Building random forest models using inputs derived from a DEM and a set of landslide initiation points to predict which areas are more susceptible to landslides.

library(terra)
## terra version 1.4.22
library(TerrainWorksUtils)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(here)
## here() starts at C:/Work/Source/LandslideSusceptibility
library(ggmap)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
## 
## Attaching package: 'ggmap'
## The following object is masked from 'package:terra':
## 
##     inset
for (file in list.files(here("helper"))) {
  source(here("helper", file))
}

dataDir <- "E:/NetmapData/Scottsburg"
shade_file <- file.path(dataDir, params$shade_file)
varRasterFiles <- lapply(params$input_rasters, function(x) {
  file.path(dataDir, x)
})
initPointsFile = file.path(dataDir, params$init_points_file)
noninitProportion = params$noninit_proportion
bufferRadius = params$buffer_radius
bufferExtractionMethod = params$buffer_extraction
initRangeExpansion = params$init_range_expansion
iterationsCount = params$iterations_count

Inputs

shade_file: E:/NetmapData/Scottsburg/shade_scottsburg.tif
varRasterFiles: E:/NetmapData/Scottsburg/grad_15.tif, E:/NetmapData/Scottsburg/pca_15m_48hr.flt, E:/NetmapData/Scottsburg/pca_15m_5hr.flt, E:/NetmapData/Scottsburg/plan_15.tif, E:/NetmapData/Scottsburg/prof_15.tif
initPOintsFile: E:/NetmapData/Scottsburg/Scottsburg_Upslope.shp
noninitProportion: 1
bufferRadius: 15
bufferExtractionMethod: center cell
initRangeExpansion: 100
iterationsCount: 5

The Data

Required inputs: DEM for a region and a set of landslide initiation points. For this exploration, we are using the Scottsburg dataset, which was collected near Scottsburg, OR, including a DEM of the region and 74 landslide initiation points collected in the field.

# get background map
terrain <- terra::rast(
  ggmap::get_map(c(-123.90, 43.5978, -123.80, 43.66), 
                 source = "osm", 
                 messaging = FALSE))
## Source : http://tile.stamen.com/terrain/13/1276/2989.png
## Source : http://tile.stamen.com/terrain/13/1277/2989.png
## Source : http://tile.stamen.com/terrain/13/1278/2989.png
## Source : http://tile.stamen.com/terrain/13/1276/2990.png
## Source : http://tile.stamen.com/terrain/13/1277/2990.png
## Source : http://tile.stamen.com/terrain/13/1278/2990.png
## Source : http://tile.stamen.com/terrain/13/1276/2991.png
## Source : http://tile.stamen.com/terrain/13/1277/2991.png
## Source : http://tile.stamen.com/terrain/13/1278/2991.png
# Get hillshade 
shade <- terra::rast(file.path(dataDir, "shade_scottsburg.tif"))

# Get points
initPoints <- terra::vect(initPointsFile)

# Match projections
terrain <- terra::project(terrain, crs(shade))


plotRGB(terrain)
plot(shade, col=grey(0:100/100), legend = FALSE, add= TRUE)
points(initPoints, cex = 2, col = "#c44a41")

Model inputs

Surface metrics are calculated from the DEM using the DEMutilities toolbox from the ForestedWetlands repo. Here, we use gradient, plan curvature, profile curvature, and elevation deviation at the 15 meter scale, and partial contributing area at the 15-meter scale calculated for a 5 hour and 48 hour period. This gives us an 6 input rasters.

We also need a set of non-initiation training points. To derive a set of meaningful non-initation points, we first exclude all points outside of the range of values where landslides can be found based on the range of input values for all landslide initation points. This gives us our “analysis region.”

varRasterList <- lapply(varRasterFiles, function(file) terra::rast(file))

# Align variable rasters
varRasterList <- TerrainWorksUtils::alignRasters(shade, varRasterList)

# Combine variable rasters into a single multi-layer raster
varsRaster <- terra::rast(varRasterList)
initBuffers <- terra::buffer(initPoints, width = bufferRadius)

# Calculate initiation range for each variable
initRange <- createInitiationRange(
  varsRaster,
  initBuffers,
  initRangeExpansion
)
initRange
##              min         max
## grad_15        0   3.3639109
## pca_15m_48hr   0 169.2360535
## pca_15m_5hr    0   4.9194107
## plan_15        0   0.5859054
## prof_15        0   0.2814967
# Identify cells in the study region that have variable values within their 
# initiation ranges
initRaster <- terra::copy(varsRaster)
for (varName in names(initRaster)) {
  varRaster <- initRaster[[varName]]
  
  # Get variable value limits
  minInitValue <- initRange[varName, "min"]
  maxInitValue <- initRange[varName, "max"]
  
  # NA-out cells with values outside variable initiation range
  varInitRaster <- terra::app(varRaster, function(x) {
    ifelse(x < minInitValue | x > maxInitValue, NA, x)
  })
  
  # Update the raster in the input raster stack
  initRaster[[varName]] <- varInitRaster
}

# NA-out cells with ANY variable value outside its initiation range
analysisRegionMask <- all(initRaster)
# 

When creating the analysis region, dev_15 (elevation deviation from the mean) has the least overlap from all the other predictors. We need to set a high value for the initRangeExpansion value or else most points will be removed. Here, we expanded the initiation range by 100 percent. The map below is colored by the number of variables which are within the initation range. Only the orange cells are included in the analysis region.

analysisCountRaster <- terra::app(initRaster, 
                                  function(x) sum(!is.na(x)))
plot(analysisCountRaster, 
     col = c(RColorBrewer::brewer.pal( 
       length(unique(values(analysisCountRaster)))-1, 
       "YlGnBu"),
       "#eb972a"
     ), 
     axes = FALSE) 

We then draw a buffer around initiation points and randomly sample non-initiation points from the remaining analysis region.

# Double the size of the initiation buffers
expInitBuffers <- terra::buffer(initPoints, 
                                width = bufferRadius * 2)

# Remove expanded initiation buffers from the viable non-initiation region
noninitRegion <- terra::copy(analysisRegionMask)
initCellIndices <- terra::extract(noninitRegion, 
                                  expInitBuffers, 
                                  cells = TRUE)$cell
noninitRegion[initCellIndices] <- NA

# Determine how many to generate
noninitBuffersCount <- ceiling(length(initPoints) * noninitProportion)

noninitBuffers <- generateNoninitiationBuffers(
  noninitBuffersCount,
  noninitRegion,
  bufferRadius
)
## Warning: [spatSample] fewer cells returned than requested

## Warning: [spatSample] fewer cells returned than requested

## Warning: [spatSample] fewer cells returned than requested
plotRGB(terrain)
plot(shade, col=grey(0:100/100), legend = FALSE, add= TRUE)
plot(analysisRegionMask, add = TRUE, col = "#377eb8")
points(initBuffers, col = "#fb8072")
points(noninitBuffers, col = "#a6d854")
legend("topright", 
       legend = c("Analysis Region", 
                  "Initiation Points", 
                  "Non-initation Points"), 
       fill = c("#377eb8", NA, NA), 
       border = c(1, "transparent", "transparent"),
       col = c(NA, "#fb8072", "#a6d854"), 
       pch = c(NA, 16, 16), 
       pt.cex = c(1, 2, 2)
)

After generating the non-initiation points, we can extract input values from the rasters to create our training data.

# Get training data
trainingData <- extractBufferValues(
  varsRaster,
  initBuffers,
  noninitBuffers,
  bufferExtractionMethod
)

# Remove coordinates from data
coordsCols <- names(trainingData) %in% c("x", "y")  
trainingData <- trainingData[,!coordsCols]
trainingData_no_y <- trainingData[, !grepl("class", names(trainingData))]

cor(trainingData_no_y[trainingData$class == "initiation",])
##                  grad_15 pca_15m_48hr pca_15m_5hr    plan_15    prof_15
## grad_15       1.00000000  -0.04878078   0.6367217 -0.1783508  0.3083373
## pca_15m_48hr -0.04878078   1.00000000   0.2394472  0.8030709 -0.2153073
## pca_15m_5hr   0.63672171   0.23944724   1.0000000  0.1379310  0.1530577
## plan_15      -0.17835078   0.80307085   0.1379310  1.0000000 -0.3634900
## prof_15       0.30833727  -0.21530733   0.1530577 -0.3634900  1.0000000
cor(trainingData_no_y[trainingData$class == "non-initiation",])
##                 grad_15 pca_15m_48hr pca_15m_5hr    plan_15     prof_15
## grad_15      1.00000000   0.24955253  0.80663274 0.07310658  0.09531334
## pca_15m_48hr 0.24955253   1.00000000  0.46492961 0.73993783  0.01780135
## pca_15m_5hr  0.80663274   0.46492961  1.00000000 0.17415226 -0.02702628
## plan_15      0.07310658   0.73993783  0.17415226 1.00000000  0.06800867
## prof_15      0.09531334   0.01780135 -0.02702628 0.06800867  1.00000000
panel.hist <- function(x, ...) {
    usr <- par("usr")
    on.exit(par(usr))
    par(usr = c(usr[1:2], 0, 1.5))
    his <- hist(x, plot = FALSE)
    breaks <- his$breaks
    nB <- length(breaks)
    y <- his$counts
    y <- y/max(y)
    rect(breaks[-nB], 0, breaks[-1], y, ...)
    # lines(density(x), col = 2, lwd = 2) # Uncomment to add density lines
}
pairs(trainingData_no_y[trainingData$class == "initiation",], 
      upper.panel = NULL, 
      diag.panel = panel.hist, 
      main = "Initation points")

pairs(trainingData_no_y[trainingData$class == "non-initiation",], 
      upper.panel = NULL, 
      diag.panel = panel.hist, 
      main = "Non-initation points")

pairs(trainingData_no_y, 
      upper.panel = NULL, 
      diag.panel = panel.hist, 
      main = "All points", 
      col = ifelse(trainingData$class == "initiation", 1, 2))
legend("topright", 
       legend = c("initiation", "non-initiation"), 
       pch = 1, 
       col = c(1, 2))

Random forest model

Next, we fit a random forest model on the training data, and examine some model stats.

rfModel <- randomForest(
  formula = class ~ .,
  data = trainingData
)
plot(rfModel)

Error rate

errorRateDf <- data.frame(rfModel$err.rate[rfModel$ntree,])
colnames(errorRateDf) <- "error rate"
errorRateDf
##                error rate
## OOB            0.07432432
## initiation     0.09459459
## non-initiation 0.05405405

Confusion Matrix

rownames(rfModel$confusion) <- c("true initiation", "true non-initiation")
rfModel$confusion
##                     initiation non-initiation class.error
## true initiation             67              7  0.09459459
## true non-initiation          4             70  0.05405405

Variable importance

imp <- importance(rfModel)
imp
##              MeanDecreaseGini
## grad_15              4.805806
## pca_15m_48hr         9.432242
## pca_15m_5hr          6.562152
## plan_15             23.854228
## prof_15             28.760166
varImpPlot(rfModel)

Partial Depentence

These plots examine the partial dependence of the model outcome on each variable, plotted in order of greatest to least importance. Higher values indicate that those values for a variable are more likely to result in classifying a point as an initiation point.

vars_by_importance <- rownames(imp)[order(imp[, 1], decreasing = TRUE)]
for (variable in vars_by_importance) {
  partialPlot(rfModel, 
              trainingData, 
              eval(variable), 
              xlab = variable, 
              main = paste("Partial Dependence on", variable), 
              ylim = c(-0.5, 4.5))
}

Non-initation Points

Let’s see how the outcome varies depending on the non-initiation points.

iterations <- 1:params$iterations_count

trainingDataList <- lapply(iterations, function(x) {
  
  noninitBuffers <- generateNoninitiationBuffers(
    noninitBuffersCount,
    noninitRegion,
    bufferRadius
  )
  
  trainingData <- extractBufferValues(
    varsRaster,
    initBuffers,
    noninitBuffers,
    bufferExtractionMethod
  )
  
  # Remove coordinates from data
  coordsCols <- names(trainingData) %in% c("x", "y")  
  trainingData[,!coordsCols]
  
})
## Warning: [spatSample] fewer cells returned than requested

## Warning: [spatSample] fewer cells returned than requested

## Warning: [spatSample] fewer cells returned than requested

## Warning: [spatSample] fewer cells returned than requested

## Warning: [spatSample] fewer cells returned than requested

## Warning: [spatSample] fewer cells returned than requested

## Warning: [spatSample] fewer cells returned than requested

## Warning: [spatSample] fewer cells returned than requested

## Warning: [spatSample] fewer cells returned than requested

## Warning: [spatSample] fewer cells returned than requested

## Warning: [spatSample] fewer cells returned than requested

## Warning: [spatSample] fewer cells returned than requested

## Warning: [spatSample] fewer cells returned than requested

## Warning: [spatSample] fewer cells returned than requested

## Warning: [spatSample] fewer cells returned than requested
rfModelList <- lapply(trainingDataList, function(d) {
  randomForest(
    formula = class ~ .,
    data = d
  )
})

for (i in iterations) {
  cat("MODEL ", i)
  trainingData <- trainingDataList[[i]]
  trainingData_no_y <- trainingData[, 1:6]
  rfModel <- rfModelList[[i]]
  pairs(trainingData[trainingData_no_y$class == "non-initiation",], 
      upper.panel = NULL, 
      diag.panel = panel.hist, 
      main = "Non-initation points")
  cat("\n\nERROR RATE\n")
  errorRateDf <- data.frame(rfModel$err.rate[rfModel$ntree,])
  colnames(errorRateDf) <- "error rate"
  cat(paste0(capture.output(errorRateDf), collapse = "\n"))
  cat("\n\nCONFUSION MATRIX\n")
  rownames(rfModel$confusion) <- c("true initiation", "true non-initiation")
  cat(paste0(capture.output(rfModel$confusion), collapse = "\n"))
  cat("\n\nIMPORTANCE\n")
  imp <- importance(rfModel)
  cat(paste0(capture.output(imp), collapse = "\n"))
  varImpPlot(rfModel)
  vars_by_importance <- rownames(imp)[order(imp[, 1], decreasing = TRUE)]
  par(mfrow = c(3,2))
  for (variable in vars_by_importance) {
    partialPlot(rfModel, 
                trainingData, 
                eval(variable), 
                xlab = variable, 
                main = variable, 
                ylim = c(-1, 5))
  }
  par(mfrow = c(1,1))
}
## MODEL  1

## 
## 
## ERROR RATE
##                error rate
## OOB            0.08783784
## initiation     0.12162162
## non-initiation 0.05405405
## 
## CONFUSION MATRIX
##                     initiation non-initiation class.error
## true initiation             65              9  0.12162162
## true non-initiation          4             70  0.05405405
## 
## IMPORTANCE
##              MeanDecreaseGini
## grad_15              4.941346
## pca_15m_48hr         9.297914
## pca_15m_5hr          5.381754
## plan_15             23.742500
## prof_15             30.119729

## MODEL  2

## 
## 
## ERROR RATE
##                error rate
## OOB            0.07432432
## initiation     0.09459459
## non-initiation 0.05405405
## 
## CONFUSION MATRIX
##                     initiation non-initiation class.error
## true initiation             67              7  0.09459459
## true non-initiation          4             70  0.05405405
## 
## IMPORTANCE
##              MeanDecreaseGini
## grad_15              4.507620
## pca_15m_48hr         9.043075
## pca_15m_5hr          6.303770
## plan_15             23.801144
## prof_15             29.892850

## MODEL  3

## 
## 
## ERROR RATE
##                error rate
## OOB            0.08783784
## initiation     0.12162162
## non-initiation 0.05405405
## 
## CONFUSION MATRIX
##                     initiation non-initiation class.error
## true initiation             65              9  0.12162162
## true non-initiation          4             70  0.05405405
## 
## IMPORTANCE
##              MeanDecreaseGini
## grad_15              4.983524
## pca_15m_48hr         9.428937
## pca_15m_5hr          6.825721
## plan_15             21.640659
## prof_15             30.649916

## MODEL  4

## 
## 
## ERROR RATE
##                error rate
## OOB            0.06756757
## initiation     0.09459459
## non-initiation 0.04054054
## 
## CONFUSION MATRIX
##                     initiation non-initiation class.error
## true initiation             67              7  0.09459459
## true non-initiation          3             71  0.04054054
## 
## IMPORTANCE
##              MeanDecreaseGini
## grad_15              4.550788
## pca_15m_48hr        12.688359
## pca_15m_5hr          6.119680
## plan_15             21.808012
## prof_15             28.372322

## MODEL  5

## 
## 
## ERROR RATE
##                error rate
## OOB            0.12162162
## initiation     0.14864865
## non-initiation 0.09459459
## 
## CONFUSION MATRIX
##                     initiation non-initiation class.error
## true initiation             63             11  0.14864865
## true non-initiation          7             67  0.09459459
## 
## IMPORTANCE
##              MeanDecreaseGini
## grad_15              6.523334
## pca_15m_48hr         8.594965
## pca_15m_5hr          5.319498
## plan_15             21.264075
## prof_15             31.772371

Probability Raster

After fitting a random forest model, we can create a probability raster by fitting the data in the analysis region to the model. We do that for each iteration, and use the average of all values.

analysisAreaVarsRaster <- terra::mask(varsRaster, analysisRegionMask)
probRasterList <- lapply(rfModelList, function(model) {
  terra::predict(
    analysisAreaVarsRaster,
    model,
    na.rm = TRUE,
    type = "prob"
  )[["initiation"]]
})
probRasterList <- terra::rast(probRasterList)
mean_prob_raster <- terra::app(probRasterList, mean)
min_prob_raster <- terra::app(probRasterList, min)
max_prob_raster <- terra::app(probRasterList, max)
diff_prob_raster <- max_prob_raster - min_prob_raster
rm(probRasterList)

plotRGB(terrain)
plot(shade, col=grey(0:100/100), add = TRUE)
plot(mean_prob_raster, 
     col = colorRampPalette(c("#56d663", "#ba5123", "#9e2509"))(25), 
     add = TRUE)
legend("left", 
       legend = c(rep(NA, 3), "0.9", rep(NA, 23), "0"), 
       fill = c(rep("transparent", 3), colorRampPalette(c("#9e2509", "#ba5123", "#56d663"))(25)), 
       y.intersp = 0.5, 
       border = "transparent", 
       title = "Mean Probability")

Proportion Raster

We can calculate a proportion raster from the probability raster, and use it to see, for example, 50% of landslides are predicted to occur in the green region plotted below.

prop_raster <- createPropRaster(mean_prob_raster)

plotRGB(terrain)
plot(shade, col=grey(0:100/100), add = TRUE)
top50 <- terra::app(prop_raster, function(x) x >= 0.5)
plot(top50, add = TRUE, col = c("transparent", "green"))

Alternatively, the other 50% of landslides are predicted to occur in the rest of the region plotted below. These are the parts of the analysis region with lower probability of landslides.

plotRGB(terrain)
plot(shade, col=grey(0:100/100), add = TRUE)
plot(top50, add = TRUE, col = c("green", "transparent"))